The focus during problem investigation is to diagnose the root cause of the problem and arrive at a workaround to restore
the service as quickly as possible. The speed and nature of the investigation widely depends on the impact, severity and
urgency of the problem. Problem Management must ensure that right level of expertise and resources are available to execute
this task. The time taken to arrive at the root cause and the solution must be in line with the defined Service Level
Agreements (SLA).
The Configuration Management Database (CMDB) should be leveraged to determine the level of impact and diagnose the exact
point of failure. Also, it is possible to investigate on which context the problem occurred and thereby find a solution in
a faster way. It is often needed to recreate the failure to understand what has gone wrong, and then try various ways of
finding the most appropriate and cost-effective resolution to address the problem. To proceed effectively without causing
further disruption to users, a "test" environment can be used to recreate the problems.
Root Cause Analysis (RCA) is based on several key analytical concepts and principals including establishing success
conditions, cause/effect relationships, data quality, risk analysis etc. Some of the different problem-solving techniques
used are chronological analysis, pain value analysis, Kempner and Tregoe, brainstorming, Ishikawa diagrams, pareto
analysis, etc.
Root Cause Analysis must be formally documented as part of Service Engagement document. This document should contain the
details of root cause investigation, contributing factors, observations, proposed solutions as well as the preventive and
corrective actions. Root Cause Analysis results are shared with Client for further discussion and approval.
|